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Abstract 

In the crystal basis model of the genetic code, it is deduced that the sum of usage probabilities 
of the codons with C and A in the third position for the quartets and/or sextets is independent 
of the biological species for vertebrates. A comparison with experimental data shows that the 
prediction is satisfied within about 5 %. 
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The genetic code is degenerate. Degeneracy refers to the fact that almost all the amino acids 
are encoded by multiple codons. Degeneracy is found primarily in the third position of the codon, 
i.e. the nucleotide in the third position can change without changing the corresponding amino acid. 
Some codons are used much more frequently than others to encode a particular amino acid. The 
pattern of codon usage varies between species and even among tissues within a species [1-7]. The 
case of bacteriae has been widely studied j|, [|fl. To our knowledge, systematic studies for eukaryotes 
are rather fragmentary ||, 0, |TTj| . Most of the analyses of the codon usage frequencies have adressed 
to analyse the relative abundance of a specified codon or in the comparison of the relative abundance 
in different biological species and/or genes. Little attention has been paid to correlations between 
codon usage frequency among different biological species. Indeed a correlation between suitable 
ratios of codon usage frequencies has been remarked in [0] for biological species belonging to the 



vertebrate class and in [|13|] for biological organisms including plants. Such correlations fit well in a 



mathematical model of the genetic code, called crystal basis model, proposed by the authors in |T4 
Moreover in |15| it has also been observed that the ratio of the previously defined quantities exhibits 
for vertebrates an almost universal behaviour, i.e. independent on the biological species and on the 
nature of the amino-acid, for the subset of the amino-acids encoded by quartets or sextets. These 
remarks suggest the possible existence of a general bias in the codon usage frequency of a specific 
codon belonging to a quartet or quartet sub-part of a sextet, i.e. the four codons differing for the 
last nucleotide. The aim of this paper is to investigate this aspect and to predict a general law which 
should be satisfied by all the biological species belonging to vertebrates. 

Let us define the usage probability for the codon XZN (X, Z, N e {A, C, G, U}) as 

P(XZN) = lim (1) 

ntot^oo n tot 

where uxzn is the number of times the codons XZN has been used in the biosynthesis process of the 
corresponding amino-acid and n tot is the total number of codons used to synthetise this amino-acid. 
It follows that our analysis and predictions hold for biological species with enough large statistics 
of codons. In our model each codon XZN is described by a state belonging to an irreducible 
representation (irrep.), denoted (Jh, JvY, of the algebra U q (sl(2)H © sl(2)v) in the limit q — > (so- 
called crystal basis); Jh, Jy take (half-) integer values and the upper label £ removes the degeneracy 
when the same couple of values of Jh, Jv appears several times. As can be seen in Table [1] there are 
for example four representations (\,\Y, with £ = 1,2,3,4. Finally within the given representation 
(Jh, Jv) two more quantum numbers Jh,3, Jv,3 are necessary to specify a particular state: see Table 
[[], which is reported here to make the paper self-consistent. 

So it is natural, in the crystal basis model, to write the usage probability as a function of the 
biological species (b.s.), of the particular amino-acid and of the labels Jh, Jv, Jh,3, Jv,3 describing 
the state XZN . Here we suppose that the dependence of the amino-acid is completely determined 
by the set of labels J's and so we write 

P(XZN) = P(b.s.; J H , J v , Jh,3, Jv,*) (2) 

Let us now make the hypothesis that we can write the r.h.s. of eq. (||) as the sum of two contributions: 
a universal function p independent on the biological species and a b.s. depending function ft, s , i.e. 

P(XZN) = p xz (J H , Jv, Jh, S , Jv,3) + ^ s Z (Jh, Jv, Jh,3, Jv#) (3) 



As it is suggested by an analysis of the available data |13|, the contribution of /& s is not negligible 
but could be smaller than the one due to p. We will also assume 



fbs Z (JH, Jv, Jh,3, Jv,3) ~ F£ Z (Jh', Jh,3) + Gf s Z (Jy] Jy )3 ) (4) 

Now, let us analyse in the light of the above considerations the usage probability for the quartets 
Ala, Gly, Pro, Thr and Val and for the quartet sub-part of the sextets Arg (i.e. the codons of the 
form CGN), Leu (i.e. CUN) and Ser (i.e. UCN). 

For Thr, Pro, Ala and Ser we can write, using Table [l] and eqs. (|])-(|), with N = A,C, G, U, 

P(NCC) + P(NCA) = &% A + F b N s c (l; x) + G£ c (§ ; y) + F b N s c {\; x') + G£ c (±; y') (5) 

where we have denoted by Pq+a the sum of the contribution of the universal function (i.e. not 
depending on the biological species) p relative to NCC and NCA, while the labels x, y, x', y' depend 
on the nature of the first two nucleotides NC, see Table [|. For the same amino acid we can also 
write 

P(NCG) + P(NCU) = p N G c +u + i£°(§; x) + G^(f; y) + F b N s c {\; x') + G» c & y') (6) 

Using the results of Table [j], we can remark that the difference between eq. (||) and eq. (H) is a 
quantity independent of the biological species, 

P(NCC) + P(NCA) - P(NCG) - P(NCU) = - p^ v = Const. (7) 

In the same way, considering the cases of Leu, Val, Arg and Gly, we obtain with W = C, G 



P{WUC) + P(WUA) - 


P{WUG) - 


- P{WUU) 


— Pc+A 


~ Pg+u = 


Const. 


(8) 


P(CGC) + P(CGA) 


- P{CGG) 


- P(CGU) 


- n CG 

— Pc+A 


~ Pg+u = 


Const. 


(9) 


P(GGC) + P(GGA) 


- P(GGG) 


- P(GGU) 


— Pc+A 


~ Pc+u = 


Const. 


(10) 



Since the probabilities for one quadruplet are normalised to one, from eqs. (p|)-(|10|) we deduce that 
for all the eight amino acids the sum of probabilities of codon usage for codons with last A and C 
(or U and G) nucleotide is independent of the biological species, i.e. 

P(XZC) + P(XZA) = Const. (XZ = NC, CU, GU, CG, GG) (11) 

In order to check our proposed sum rules eq. ([11]) on experimental data, we have considered the data 
for species from the GenBank (release 127.0 of Dec. 2001) for vertebrates with codon statistics larger 
than 60.000, see Table [2|. In Table ^| we report the experimental data for the selected 21 vertebrates. 
The comparison shows that our predictions are verified within 3-6 %, which is an amazing result. 
Moreover if one considers only the species with highest statistics (No. 1, 2, 3 in Table[2|) the prediction 
is verified within about 3 %. 

In Table [| we report the mean value and the standard deviation of the probability of usage of 
the codons XZN (XZ = NC, CU, GU, CG, GG) computed over all biological species given in Table 
||. It can be remarked that this probability shows a rather large spread which is surprisingly reduced 
when one makes the sum (compare Tables [| and ||) . 

Moreover, if we assume that for sextets the functions F and G depend really on the nature of 
the encoded amino acid rather than on the dinucleotide, from the analysis of the content in the 



irreps of Table [I], we derive in a completely analogous way as above that for the amino acid Ser the 
sum P' C+A (S) = P{UCA) + P(AGC) is independent of the biological species (note that we have 
normalised the probabilites by P(UCA) + P(AGC) + P(UCG) + P(AGU) = 1). The experimental 
data are in good agreement with this a priori surprising result, see last column P' C+A (S) of Table [|. 

In Table |5] we have reported the statistical estimators (mean value, standard deviation and their 
ratio) for the probabilities given in eq. (p|). It should be remarked that essentially two species: 
Danio rerio (zebrafish) and Pan troglodytes (chimpanze), differ sensibly from the average value for 
most amino acids. For the chimpanze, it may be a statistical fluctuation due to the relative low 
number of codons. 

In order to evaluate how our results are statistically more significant than generic results, we 
have computed the sum of the probabilities of the codon usage P(XZC) + P(XZU) and P(XZC) + 
P(XZG) where XZ denote the dinucleotides of the eight quartets/sextets. To save space we do 
not report here the corresponding tables, but only the values of the statistical estimators, see Table 
|5|. We see that for P(XZC) + P(XZU) the standard deviation is a bit larger than the one for 
P(XZC) + P(XZA) and even larger for P(XZC) + P(XZG). This result fits also in our model as 
in the first case the probability differs from species independent factors, essentially for the presence 
of factors Gb s (Jv'i Jv,3), while in the last case, also for the presence of factors Fb s (Jv; Jv,3)- 

Now let us make two important remarks. 

- If we write for p(Jh, Jv, Jh,3, Jv,3), or equivalently Pc+a> an expression of the type (f|), i.e. sep- 
arating the H from the V dependence, it follows that the r.h.s. of eqs. fl5|) and (§) are equal, and 
consequently the probabilities P(NCC) + P(NCA) and P(NCU) + P(NCG) should be equal, which 
is not experimentally verified. This means that the coupling term between the H and the V is not 
negligible for the p(Jh, Jv> Jh,s-, Jv.3) function. 

- Summing equations (§) and (||) we deduce that the expression F^°(h x)+G^ s c (h y)+F^°(h x') + 
G^ s c (hy') is actually not depending on the biological species. From eqs. (j|) and @ for different 
values of iV and for analogous equations for the other four quartets, we can derive relations between 
sums of F^ c and/or G^ s c functions which are independent of the biological species. 

Let us emphasize our claim: we have remarked that the sum of the usage probabilities of two 
suitably choosen codons is, within a few percent, a constant independent on the biological species for 
vertebrates, which well fits in the framework of the crystal basis model. Of course one can restate 
the above results stating the sum of the probability of codon usage XZC + XZA is not depending on 
the nature of the biological specie, without any reference to crystal basis model. However a deeper 
analysis of Table |3| shows that Pc+a for Pro, Thr, Ala, Ser and Gly is of the order of 0.62, for Leu 
and Val of the order 0.35 while for Arg is of order 0.52. In the crystal basis model the roots, i.e. the 
dinucleotide formed by the first two nucleotides of the first 5 amino acids belong to the same irrep. 
(1,1), the roots of Leu and Val belong to the irrep. (0,1), while the root of Arg belongs to the irrep. 
(1,0). This is an interesting result, especially for Pro whose molecule has a different structure than 
the others amino acids (Pro has an imino group instead of an amino group). 

It is natural to wonder what happens for other biological species. The green plants exhibits 
roughly the same pattern, but probably a more reliable analysis has to be performed considering a 
splitting into families. For invertebrates, the large number of existing biological species and the lack 
of data with sufficient diversity prevents from applying a similar analysis. The case of bacteriae is 
rather interesting. Eubacteriae seem to avoid this pattern of correlations. This may be the influence 



of selection effects which may be stronger or effective in shorter times in less complicated species. 
For bacteriae the G+C content varies in a wide range from 25 % to 75 %. Hence one can argue that 
biological species with large difference in the G+C content exhibit large difference in the correlation 
pattern discussed in this paper. However, using the Genbank data, one finds for eubacteriae no 
correlation between the G+C content and the value of the probabilities eq. (|TT1). 
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Table 1: The eukaryotic or standard code code. The upper label denotes different irreducible repre- 
sentations, 



codon amino acid 


Jh Jv 


J$,H J3,V 


codon amino acid 


Jh Jv 


Jh,3 Jv,3 


CCC Pro P 
CCU Pro P 
CCG Pro P 
CCA Pro P 


3/2 3/2 
(1/2 3/2) 1 
(3/2 1/2) 1 
(1/2 1/2) 1 


3/2 3/2 
1/2 3/2 
3/2 1/2 
1/2 1/2 


UCC Ser S 
UCU Ser S 
UCG Ser S 
UCA Ser S 


3/2 3/2 
(1/2 3/2) 1 
(3/2 1/2) 1 
(1/2 1/2) 1 


1/2 3/2 
-1/2 3/2 

1/2 1/2 
-1/2 1/2 


CUC Leu L 
CUU Leu L 
CUG Leu L 
CUA Leu L 


(1/2 3/2)* 
(1/2 3/2) 2 
(1/2 1/2) 3 
(1/2 1/2) 3 


1/2 3/2 
-1/2 3/2 

1/2 1/2 
-1/2 1/2 


UUC Phe F 
UUU Phe F 
UUG Leu L 
UUA Leu L 


3/2 3/2 
3/2 3/2 
(3/2 1/2) 1 
(3/2 1/2) 1 


-1/2 3/2 
-3/2 3/2 
-1/2 1/2 
-3/2 1/2 


CGC Arg R 
CGU Arg R 
CGG Arg R 
CGA Arg R 


(3/2 1/2)* 
(1/2 1/2) 2 
(3/2 1/2) 2 
(1/2 1/2) 2 


3/2 1/2 
1/2 1/2 
3/2 -1/2 
1/2 -1/2 


UGC Cys C 
UGU Cys C 
UGG Trp W 
UGA Ter 


(3/2 1/2) 2 
(1/2 1/2) 2 
(3/2 1/2) 2 
(1/2 1/2) 2 


1/2 1/2 
-1/2 1/2 

1/2 -1/2 
-1/2 -1/2 


CAC His H 
CAU His H 
CAG Gin Q 
CAA Gin Q 


(1/2 1/2) 4 
(1/2 1/2) 4 
(1/2 1/2) 4 
(1/2 1/2) 4 


1/2 1/2 
-1/2 1/2 

1/2 -1/2 
-1/2 -1/2 


UAC Tyr Y 
UAU Tyr Y 
UAG Ter 
UAA Ter 


(3/2 1/2) 2 
(3/2 1/2) 2 
(3/2 1/2) 2 
(3/2 1/2) 2 


-1/2 1/2 
-3/2 1/2 
-1/2 -1/2 
-3/2 -1/2 


GCC Ala A 
GCU Ala A 
GCG Ala A 
GCA Ala A 


3/2 3/2 
(1/2 3/2) 1 
(3/2 1/2) 1 
(1/2 1/2) 1 


3/2 1/2 
1/2 1/2 
3/2 -1/2 
1/2 -1/2 


ACC Thr T 
ACU Thr T 
ACG Thr T 
ACA Thr T 


3/2 3/2 
(1/2 3/2) 1 
(3/2 1/2) 1 
(1/2 1/2) 1 


1/2 1/2 
-1/2 1/2 

1/2 -1/2 
-1/2 -1/2 


GUC Val V 
GUU Val V 
GUG Val V 
GUA Val V 


(1/2 3/2) 2 
(1/2 3/2) 2 
(1/2 1/2) 3 
(1/2 1/2) 3 


1/2 1/2 
-1/2 1/2 

1/2 -1/2 
-1/2 -1/2 


AUC He I 
AUU He I 
AUG Met M 
AUA He I 


3/2 3/2 
3/2 3/2 
(3/2 1/2) 1 
(3/2 1/2) 1 


-1/2 1/2 
-3/2 1/2 
-1/2 -1/2 
-3/2 -1/2 


GGC Gly G 
GGU Gly G 
GGG Gly G 
GGA Gly G 


3/2 3/2 
(1/2 3/2) 1 
3/2 3/2 
(1/2 3/2) 1 


3/2 -1/2 
1/2 -1/2 
3/2 -3/2 
1/2 -3/2 


AGC Ser S 
AGU Ser S 
AGG Arg R 
AGA Arg R 


3/2 3/2 
(1/2 3/2) 1 
3/2 3/2 
(1/2 3/2) 1 


1/2 -1/2 
-1/2 -1/2 

1/2 -3/2 
-1/2 -3/2 


GAC Asp D 
GAU Asp D 
GAG Glu E 
GAA Glu E 


(1/2 3/2) 2 
(1/2 3/2) 2 
(1/2 3/2) 2 
(1/2 3/2) 2 


1/2 -1/2 
-1/2 -1/2 

1/2 -3/2 
-1/2 -3/2 


AAC Asn N 
AAU Asn N 
AAG Lys K 
AAA Lys K 


3/2 3/2 
3/2 3/2 
3/2 3/2 
3/2 3/2 


-1/2 -1/2 
-3/2 -1/2 
-1/2 -3/2 
-3/2 -3/2 



Table 2: Data for vertebrates from GenBank Release 127.0 [15 December 2001] 





Biological species 


number of sequences 


number of codons 


1 


Homo sapiens 


41504 


18611700 


2 


Mus musculus 


17286 


8079821 


3 


Rattus norvegicus 


6578 


3324518 


4 


Gallus gallus 


2089 


1019029 


5 


Xenopus laevis 


1961 


929562 


6 


Bos taurus 


1694 


764195 


7 


Danio rerio 


1209 


535583 


8 


Oryctolagus cuniculus 


864 


441547 


9 


Macaca fascicularis 


1321 


403875 


10 


Sus scrofa 


914 


380357 


11 


Canis familiaris 


489 


229526 


12 


Takifugu rubripes 


259 


152479 


13 


Ovis aries 


413 


134027 


14 


Oncorhynchus mykiss 


366 


131431 


15 


Cricetulus griseus 


217 


109395 


16 


Rattus sp. 


265 


106164 


17 


Pan troglodytes 


272 


88272 


18 


Oryzias latipes 


183 


85610 


19 


Macaca mulatta 


264 


81673 


20 


Felis cattus 


177 


66930 


21 


Equus caballus 


177 


59932 



Table 3: Sum of usage probability of codons Pc+a(XN) = P(XNC) + P(XNA). The number in the 
first column denotes the biological species of Table ||. The amino acid are labelled by the standard 
letter. Morevover Pc +A (S) = P(UCA) + P(AGC). 



Biological 

species 


Pc+a(P) 


Pc+a(A) 


Pc+a(T) 


Pc+a(S) 


Pc+a(V) 


Pc+a{L) 


Pc+a(R) 


Pc+a(G) 


P'c+a(S) 


1 


0.60 


0.63 


0.64 


0.60 


0.35 


0.33 


0.51 


0.59 


0.65 


2 


0.59 


0.61 


0.65 


0.59 


0.36 


0.34 


0.52 


0.59 


0.65 


3 


0.59 


0.62 


0.65 


0.60 


0.37 


0.34 


0.52 


0.59 


0.65 


4 


0.60 


0.59 


0.62 


0.59 


0.35 


0.31 


0.53 


0.58 


0.67 


5 


0.60 


0.60 


0.62 


0.56 


0.38 


0.33 


0.50 


0.58 


0.62 


6 


0.60 


0.63 


0.65 


0.60 


0.35 


0.33 


0.52 


0.60 


0.65 


7 


0.53 


0.56 


0.61 


0.56 


0.34 


0.32 


0.55 


0.63 


0.63 


8 


0.61 


0.65 


0.64 


0.63 


0.35 


0.33 


0.55 


0.61 


0.66 


9 


0.60 


0.63 


0.63 


0.60 


0.37 


0.35 


0.50 


0.58 


0.65 


10 


0.61 


0.64 


0.65 


0.62 


0.36 


0.33 


0.53 


0.61 


0.67 


11 


0.61 


0.62 


0.64 


0.60 


0.38 


0.34 


0.52 


0.59 


0.65 


12 


0.58 


0.58 


0.61 


0.59 


0.38 


0.32 


0.55 


0.59 


0.64 


13 


0.61 


0.63 


0.66 


0.60 


0.35 


0.35 


0.55 


0.61 


0.67 


14 


0.62 


0.61 


0.68 


0.60 


0.38 


0.33 


0.53 


0.57 


0.66 


15 


0.61 


0.62 


0.66 


0.58 


0.37 


0.33 


0.51 


0.59 


0.64 


16 


0.58 


0.62 


0.66 


0.59 


0.37 


0.34 


0.52 


0.59 


0.66 


17 


0.62 


0.54 


0.72 


0.59 


0.29 


0.35 


0.58 


0.58 


0.69 


18 


0.56 


0.59 


0.63 


0.60 


0.35 


0.31 


0.55 


0.63 


0.68 


19 


0.61 


0.61 


0.65 


0.61 


0.34 


0.34 


0.50 


0.59 


0.64 


20 


0.61 


0.63 


0.64 


0.60 


0.38 


0.34 


0.52 


0.60 


0.64 


21 


0.58 


0.63 


0.66 


0.62 


0.37 


0.34 


0.53 


0.61 


0.68 



Table 4: Mean value, standard deviation and their ratio for the probabilities P(XZN) corresponding 
to the eight amino-acids related to quartets or sextets for the choice of biological species of Table |[ 





P(CCU) 


P(CCC) 


P(CCA) 


P(CCG) 


P(ACU) 


P(AGG) 


P(AGA) 


P(AGG) 


X 


0.28 


0.33 


0.26 


0.13 


0.23 


0.39 


0.26 


0.13 


a 


0.028 


0.043 


0.034 


0.028 


0.030 


0.050 


0.034 


0.027 


a/x 


10.0 % 


12.8 % 


13.3 % 


22.3 % 


13.1 % 


13.0 % 


13.0 % 


21.4 % 




P{GCU) 


P(GCC) 


P{GCA) 


P(GCG) 


P{UCU) 


P(UCC) 


P(Z7CA) 


P(UCG) 


X 


0.27 


0.40 


0.21 


0.12 


0.30 


0.38 


0.22 


0.10 


a 


0.026 


0.046 


0.035 


0.029 


0.027 


0.036 


0.026 


0.020 


a/x 


9.5 % 


11.6 % 


16.5 % 


25.3 % 


8.8 % 


9.3 % 


12.0 % 


20.2 % 




P(GUU) 


P(GUC) 


P(GUA) 


P(GUG) 


P(CUU) 


P{CUC) 


P(CUA) 


P(CUG) 


X 


0.17 


0.26 


0.10 


0.47 


0.15 


0.25 


0.08 


0.52 


a 


0.036 


0.023 


0.026 


0.045 


0.034 


0.018 


0.017 


0.035 


a/x 


20.9 % 


9.1 % 


25.8 % 


9.5 % 


22.6 % 


7.2 % 


20.6 % 


6.7 % 




P(CGU) 


P(CGC) 


P{CGA) 


P(CGG) 


P{GGU) 


P(GGG) 


P(GGA) 


F(GGG) 


X 


0.16 


0.34 


0.18 


0.31 


0.17 


0.33 


0.26 


0.23 


a 


0.042 


0.039 


0.026 


0.043 


0.029 


0.034 


0.033 


0.032 


a/x 


26.0 % 


11.4 % 


14.0 % 


13.9 % 


16.9 % 


10.3 % 


12.7 % 


13.7 % 



Table 5: Mean value, standard deviation and their ratio for the sums of probabilities Pc+a, Pc+u, 
Pc+g corresponding to the eight amino-acids related to quartets or sextets for the choice of biological 
species of Table [|. The amino acid are labelled by the standard letter. 





Pc+a(P) 


Pc+a(A) 


Pc+a(T) 


Pc+a(S) 


Pc+a(V) 


Pc+a(L) 


Pc+a{R) 


Pc+a(G) 


X 


0.595 


0.611 


0.646 


0.598 


0.359 


0.334 


0.527 


0.596 


a 


0.020 


0.027 


0.024 


0.016 


0.020 


0.012 


0.020 


0.015 


a/x 


3.4 % 


4.4 % 


3.8 % 


2.6 % 


5.6 % 


3.7 % 


3.8 % 


2.5 % 




Pc+u(P) 


Pc+u (A) 


Pc+u(T) 


Pc+u(S) 


Pc+u(V) 


Pc+u(L) 


Pc+u{R) 


Pc+u(G) 


X 


0.613 


0.672 


0.614 


0.687 


0.430 


0.401 


0.506 


0.507 


a 


0.030 


0.028 


0.027 


0.026 


0.031 


0.022 


0.049 


0.024 


a/x 


4.9 % 


4.2 % 


4.4 % 


3.8 % 


7.2 % 


5.4 % 


9.7 % 


4.8 % 




Pc+g(P) 


Pc+g(A) 


Pc+g(T) 


Pc+g(S) 


Pc+g(V) 


Pc+g(L) 


Pc+g{R) 


Pc+g(G) 


X 


0.462 


0.513 


0.511 


0.479 


0.728 


0.769 


0.652 


0.567 


a 


0.058 


0.058 


0.063 


0.046 


0.056 


0.048 


0.055 


0.058 


a/x 


12.4 % 


11.3 % 


12.3 % 


9.6 % 


7.7 % 


6.3 % 


8.4 % 


10.2 % 



